Justin Dano
FE550 - Data Visualization Applications
Final Project
The value of cryptocurrencies, such as Bitcoin have been known to show wild fluctuations in price, and these price variations are far more volatile then any traditional financial asset in today's market. The aim of this project is to determine just how much do news articles impact the price of Bitcoin. To accomplish this goal, I will use data visualizations and perform sentiment analysis on a dataset of recent news article using Naive Bayes Classification.
The motivation for this project is to determine if news sentiment can be used to properly predict the price of cryptocurrencies. Another source of motivation is to try and understand the causes of such volatile price movements. Ultimately, if the ability to predict the future price of Bitcoin by news sentiment is plausible, it could be implemented into an investment strategy. While the project is currently focused on Bitcoin, it is quite easy to extend the functionality to other cryptocurrencies. The overall aim of this project is to attempt to answer the following research questions:
1. How does the news directly impact the immediate price of Bitcoin?
2. Is the news the driving factor in how Bitcoin is priced?
3. Can news sentiment be used to accurately predict the future price of Bitcoin?
Python 3.6.1
Anaconda 3-4.4.0
Pandas 0.20.3
NLTK 3.2.5
Plotly 2.2.3
Developed on a Jupyter notebook.
This project has three sections:
To begin, both Bitcoin data and news articles related to Bitcoin will need to be extracted. Next, some cleaning will needed to be done to both datasets before the sentiment of the news articles can be analyzed. For determining if a particular news article has positive or negative sentiment, the price of Bitcoin (at the time the article is published) is compared with the price of Bitcoin at a pre-defined offset time. This analysis uses an offset parameter of 1 minute after the article has been published. So if the price of Bitcoin has an up-tick one minute after the article was published, the article will be considered positive sentiment, and vice versa. To build the classifier, I used a naive Bayes classifier (from the NLTK library) based on the words from the description of each article (more on this in a minute). The results are saved as .csv file and finally used here in the Jupyter notebook for visualization.
Below I have included the general workflow of the project, with an additional summary for each component:
from IPython.display import Image
Image(filename='final_project_pipeline.png')
import pandas as pd
pd.read_csv('coinbaseUSD.csv', names=['timestamp', 'price', 'amount']).tail()
pd.read_csv('news.csv', names=['author', 'description', 'popularity', 'published_at', 'source', 'title', 'url', 'url_image','nc', 'scraping_date']).tail()
pd.read_csv('bitcoin_data_min_tick.csv', names=['timestamp', 'price', 'amount']).tail()
pd.read_csv('news_data_min_tick.csv').tail()
The script sentimentAnalyzer.py is used to perform sentiment analysis on the news articles. Since the code is somewhat involved, and hundreds of lines long, I will leave it omitted here. For more info on Naive Bayes Classification, please visit here or view the source code directly. I will speak in detail about how classification is done for this project however and also the general process for building the predictive model.
How do we determine what is positive sentiment or negative sentiment? Traditional NLTK packages have some libraries that can determine the sentiment accuracy with a good degree of accuracy, such as Sentinet. For example, if an article has words like "good", "great, "awesome" it might be considered positive sentiment.
This was not the approach I went with however, and instead I decided to build the classifier by comparing how the price of Bitcoin changed after a pre-determined time-frame. If the price of Bitcoin increased in price at the given time-frame, it is considered positive sentiment. If the price of Bitcoin decreased in price, it is considered negative sentiment. So the initial classification is done irrespective of the words in the actual article, but rather on how the price changed after the article was published.
Note: Sentiment and Prediction are the last columns of the dataframe
pd.read_csv('news_sentiment_predictions.csv').tail()
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
init_notebook_mode(connected=True)
The following cell is essentially doing a left join on the Bitcoin data with the news sentiment data. For a given timestamp, if a news article was published, include all the relevant details to the dataframe.
# Get Bitcoin Data
bitcoin_data = pd.read_csv('bitcoin_data_min_tick.csv')
bitcoin_data = bitcoin_data.set_index('timestamp', drop=False)
# Get News Articles that were part of Training Dataset
training_news = pd.read_csv('training_news.csv')
training_news = training_news.set_index('published_at', drop=False)
# Format sentiment (float -> Int)
training_news.sentiment = training_news.sentiment.astype(int)
bitcoin_data['article_published'] = 0
# Adds news data to rows (timestamps) that also have an article published
bitcoin_data.loc[training_news.index, 'article_published'] = 1
bitcoin_data.loc[training_news.index, 'sentiment'] = training_news.sentiment
bitcoin_data.loc[training_news.index, 'title'] = training_news.title
bitcoin_data.loc[training_news.index, 'url'] = training_news.url
bitcoin_data.loc[training_news.index, 'source'] = training_news.source
bitcoin_data.loc[training_news.index, 'author'] = training_news.author
bitcoin_data.head()
def plot_bitcoin_and_sentiment_range(sentiment, start_day, end_day=None):
title = None
legend_name = None
sentiment_color = None
# Setup the title and colors based on sentiment and end_day parameter
if sentiment == 1:
if end_day is None:
title = 'Price of Bitcoin on {} with Positive News Sentiment'.format(start_day)
else:
title = 'Price of Bitcoin between {} and {} with Positive News Sentiment'.format(start_day, end_day)
legend_name = 'Positive News'
sentiment_color = 'green'
elif sentiment == -1:
if end_day is None:
title = 'Price of Bitcoin on {} with Negative News Sentiment'.format(start_day)
else:
title = 'Price of Bitcoin between {} and {} with Negative News Sentiment'.format(start_day, end_day)
legend_name = 'Negative News'
sentiment_color = 'red'
# Filter bitcoin time-series based on date
if end_day is None:
plot_bitcoin_data = bitcoin_data[bitcoin_data['timestamp'].str.contains(start_day)]
else:
plot_bitcoin_data = bitcoin_data[(bitcoin_data['timestamp'] >= start_day)
& (bitcoin_data['timestamp'] <= end_day)]
# Plot Bitcoins Price
trace0 = go.Scatter(
x=plot_bitcoin_data.timestamp,
y=plot_bitcoin_data.price,
name='USD/BTC')
# Specifically used to plot the news articles on the time series
news_sentiment = plot_bitcoin_data[plot_bitcoin_data['sentiment'] == sentiment]
# Plot Points of News Articles
trace1 = go.Scatter(
x=news_sentiment.timestamp,
y=news_sentiment.price,
mode='markers',
name=legend_name,
marker = dict(size=5,line = dict(width=1),color=sentiment_color),
hoverinfo='none',
textposition='top')
# Plot news Info
trace2 = go.Scatter(
x=news_sentiment.timestamp,
y=[min(plot_bitcoin_data.price) - 10]*len(news_sentiment.timestamp),
text='Title: ' + news_sentiment.loc[news_sentiment.timestamp]['title'] + '<br>' +
'Source: ' + news_sentiment.loc[news_sentiment.timestamp]['source'] + '<br>' +
'Author: ' + news_sentiment.loc[news_sentiment.timestamp]['author'],
showlegend=False,
marker = dict(color=sentiment_color)
)
plot_data = [trace0, trace1, trace2]
plot_layout = go.Layout(title=title,
titlefont=dict(family='Courier New, monospace', size=20),
xaxis=dict(
title='Time',
titlefont=dict(family='Courier New, monospace',size=18)),
yaxis=dict(
title='USD/BTC',
titlefont=dict(family='Courier New, monospace',size=18))
)
fig = dict(data=plot_data, layout=plot_layout)
iplot(fig)
The entire history of news sentiment is between September 21st, 2017 and November 11th, 2017. This is largely due to the limitations from the newsAPI. For our training data, our universe is between September 21st and October 23rd. Our first two visualizations are the entire universe of training data for both negative and positive news sentiment.
plot_bitcoin_and_sentiment_range(-1, '2017-09-21', '2017-10-23')
plot_bitcoin_and_sentiment_range(1, '2017-09-21', '2017-10-23')
While these graphs put may put you in the Christmas spirit, they don't really provide any useful information. Essentially the size of the graph will not work with the amount of news articles published. Lucky for us, the method was made in such a way where scaling down on specific days can be done via the method call. By analyzing a specific day, we can learn a lot more about the news sentiment.
plot_bitcoin_and_sentiment_range(-1, '2017-10-12')
Each dot on the time-series represents a news article that experienced a corresponding down-tick in Bitcoin Price. What makes this classifier somewhat bad is the arbitrary choice of 1 minute after the article is released. Take for example the article around at 7:35am, titled "The Blockchain Intersection with Supply Chain Data". The initial price did decreased after the article was released, but only to be followed by a huge spike in price. One might speculate that this article was actually positive sentiment, and was simply considering the wrong time frame. Another possible solution is that another news article was released with positive sentiment shortly after the one with negative sentiment. Let us look at the same day, but filter on positive sentiment.
plot_bitcoin_and_sentiment_range(1, '2017-10-12')
As suspected, there indeed is another news article prior to the huge spike at around 8:00am. Unfortunately, we can not conclude that the article directly impacted the price swing. A better approach would be to look and see if any patterns emerge with when news articles are published and how the change in price occurs. While not perfect, it does appear that positive news sentiment can be found before most price hikes. Next we will combine the articles from both negative and positive sentiment to get a more clear picture of how news may be affecting the price of Bitcoin.
def plot_bitcoin_and_both_sentiment_range(start_day, end_day=None):
if end_day is None:
title = 'Price of Bitcoin on {} with News Sentiment'.format(start_day)
else:
title = 'Price of Bitcoin between {} and {} with News Sentiment'.format(start_day, end_day)
# Filter bitcoin time-series based on date
if end_day is None:
plot_bitcoin_data = bitcoin_data[bitcoin_data['timestamp'].str.contains(start_day)]
else:
plot_bitcoin_data = bitcoin_data[(bitcoin_data['timestamp'] >= start_day)
& (bitcoin_data['timestamp'] <= end_day)]
# Plot Bitcoins Price
trace0 = go.Scatter(
x=plot_bitcoin_data.timestamp,
y=plot_bitcoin_data.price,
name='USD/BTC')
# Specifically used to plot the news articles on the time series
pos_news_sentiment = plot_bitcoin_data[plot_bitcoin_data['sentiment'] == 1]
neg_news_sentiment = plot_bitcoin_data[plot_bitcoin_data['sentiment'] == -1]
# Plot Points of News Articles
pos_trace = go.Scatter(
x=pos_news_sentiment.timestamp,
y=pos_news_sentiment.price,
mode='markers',
name='Positive News',
marker = dict(size=5,line = dict(width=1),color='green'),
hoverinfo='none',
textposition='top')
# Plot Points of News Articles
neg_trace = go.Scatter(
x=neg_news_sentiment.timestamp,
y=neg_news_sentiment.price,
mode='markers',
name='Negative News',
marker = dict(size=5,line = dict(width=1),color='red'),
hoverinfo='none',
textposition='top')
# Plot news Info
all_news = pd.concat([pos_news_sentiment, neg_news_sentiment])
# Plot Positive news Info
news_info = go.Scatter(
x=all_news.timestamp,
y=[min(plot_bitcoin_data.price) - 10]*len(all_news.timestamp),
text='Title: ' + all_news.loc[all_news.timestamp]['title'] + '<br>' +
'Source: ' + all_news.loc[all_news.timestamp]['source'] + '<br>' +
'Author: ' + all_news.loc[all_news.timestamp]['author'],
showlegend=False,
marker = dict(color='#1F77B4')
)
plot_data = [trace0, pos_trace, neg_trace, news_info]
plot_layout = go.Layout(title=title,
titlefont=dict(family='Courier New, monospace', size=20),
xaxis=dict(
title='Time',
titlefont=dict(family='Courier New, monospace',size=18)),
yaxis=dict(
title='USD/BTC',
titlefont=dict(family='Courier New, monospace',size=18))
)
fig = dict(data=plot_data, layout=plot_layout)
iplot(fig)
Now with both positive and negative sentiment plotted on the time-series, a complete picture can be visualize and how different news articles may impact the change in price. I have included a few samples below.
plot_bitcoin_and_both_sentiment_range('2017-10-12')
plot_bitcoin_and_both_sentiment_range('2017-10-13')
plot_bitcoin_and_both_sentiment_range('2017-10-13', '2017-10-18')
All previous visualizations were of the training data used to create the Naive Bayes Classifier. Now let us visualize the results of predicting the sentiment of news articles.
news_data_with_predictions = pd.read_csv('news_sentiment_predictions.csv')
news_data_with_predictions = news_data_with_predictions.set_index('published_at', drop=False)
news_data_with_predictions.sentiment = news_data_with_predictions.sentiment.astype(int)
bitcoin_data = pd.read_csv('bitcoin_data_min_tick.csv')
bitcoin_data = bitcoin_data.set_index('timestamp', drop=False)
# Combine Bitcoin Data with News Sentiment
bitcoin_data['article_published'] = 0
bitcoin_data.loc[news_data_with_predictions.index, 'article_published'] = 1
bitcoin_data.loc[news_data_with_predictions.index, 'sentiment'] = news_data_with_predictions.sentiment
bitcoin_data.loc[news_data_with_predictions.index, 'predicted'] = news_data_with_predictions.predicted
bitcoin_data.loc[news_data_with_predictions.index, 'title'] = news_data_with_predictions.title
bitcoin_data.loc[news_data_with_predictions.index, 'source'] = news_data_with_predictions.source
bitcoin_data.loc[news_data_with_predictions.index, 'author'] = news_data_with_predictions.author
def plot_bitcoin_and_predicted_sentiment(side, start_day, end_day=None):
if end_day is None:
title = 'Price of Bitcoin on {} with Predicted News Sentiment'.format(start_day)
else:
title = 'Price of Bitcoin between {} and {} with Predicted News Sentiment'.format(start_day, end_day)
sentiment_color='green'
# Filter bitcoin time-series based on date
if end_day is None:
plot_bitcoin_data = bitcoin_data[bitcoin_data['timestamp'].str.contains(start_day)]
else:
plot_bitcoin_data = bitcoin_data[(bitcoin_data['timestamp'] >= start_day)
& (bitcoin_data['timestamp'] <= end_day)]
# Specifically used to plot the news articles on the time series
news_sentiment = plot_bitcoin_data[plot_bitcoin_data['article_published'] == 1]
if side == 'correct':
symbol='0'
pos_sentiment = news_sentiment[(news_sentiment.predicted == 1) & (news_sentiment.sentiment == 1)]
neg_sentiment = news_sentiment[(news_sentiment.predicted == -1) & (news_sentiment.sentiment == -1)]
pos_title = 'Correctly Predicted Positive Sentiment'
neg_title = 'Correctly Predicted Negative Sentiment'
elif side == 'incorrect':
symbol='x'
pos_sentiment = news_sentiment[(news_sentiment.predicted == 1) & (news_sentiment.sentiment != 1)]
neg_sentiment = news_sentiment[(news_sentiment.predicted == -1) & (news_sentiment.sentiment != -1)]
pos_title = 'Incorrectly Predicted Positive Sentiment'
neg_title = 'Incorrectly Predicted Negative Sentiment'
# Plot Bitcoins Price
bitcoin_price_trace = go.Scatter(
x=plot_bitcoin_data.timestamp,
y=plot_bitcoin_data.price,
name='USD/BTC')
# Plot positive sentiment points that were correct (green O)
pos_trace = go.Scatter(
x=pos_sentiment.timestamp,
y=pos_sentiment.price,
mode='markers',
name=pos_title,
marker = dict(symbol=symbol, size=7, line = dict(width=1),color='green'),
hoverinfo='none')
# Plot negative sentiment points that were correct (red O)
neg_trace = go.Scatter(
x=neg_sentiment.timestamp,
y=neg_sentiment.price,
mode='markers',
name=neg_title,
marker = dict(symbol=symbol, size=7, line = dict(width=1),color='red'),
hoverinfo='none')
# Plot news Info
all_news = pd.concat([pos_sentiment, neg_sentiment])
news_info_trace = go.Scatter(
x=all_news.timestamp,
y=[min(plot_bitcoin_data.price) - 10]*len(all_news.timestamp),
text='Title: ' + all_news.loc[all_news.timestamp].title + '<br>' +
'Source: ' + all_news.loc[all_news.timestamp].source + '<br>' +
'Author: ' + all_news.loc[all_news.timestamp].author + '<br>' +
'Actual Sentiment: ' + all_news.loc[all_news.timestamp].sentiment.astype(str) + '<br>' +
'Predicted Sentiment: ' + all_news.loc[all_news.timestamp].predicted.astype(str),
showlegend=False,
marker = dict(color='#1F77B4')
)
plot_data = [bitcoin_price_trace, pos_trace, neg_trace, news_info_trace]
plot_layout = go.Layout(title=title,
titlefont=dict(family='Courier New, monospace', size=20),
xaxis=dict(
title='Time',
titlefont=dict(family='Courier New, monospace',size=18)),
yaxis=dict(
title='USD/BTC',
titlefont=dict(family='Courier New, monospace',size=18))
)
fig = dict(data=plot_data, layout=plot_layout)
iplot(fig)
Now we have a similar method as before, but this time we will only be visualizing correct or incorrect news sentiment predictions. I have included a few samples below, first over a period of a few days, followed by a specific analysis of October 24th.
plot_bitcoin_and_predicted_sentiment('correct', '2017-10-25', '2017-11-01')
Now let us examine the same timeframe for news sentiment that was incorrectly predicted.
plot_bitcoin_and_predicted_sentiment('incorrect', '2017-10-25', '2017-11-01')
As before, its somewhat difficult to gain an intuition of the news sentiment based on the relative size of the graph and number of data points. So for now we will focus on one day, October 24th, 2017 for analysis.
plot_bitcoin_and_predicted_sentiment('correct', '2017-10-24')
While not perfect, it does appear that most articles are typically placed right near huge changes in price. Even if the sentiment is not completely accurate, it does go to show that news can indeed impact the immediate price of Bitcoin. Now let us examine some incorrect predictions.
plot_bitcoin_and_predicted_sentiment('incorrect', '2017-10-24')
As with the training data. It is somewhat hard to see the whole picture without the entire domain of news articles incorporated into the visualization. Our final visualization will combine all sentiment predictions in one visualization.
def plot_bitcoin_and_predicted_sentiment(start_day, end_day=None):
if end_day is None:
title = 'Price of Bitcoin on {} with Predicted News Sentiment'.format(start_day)
else:
title = 'Price of Bitcoin between {} and {} with Predicted News Sentiment'.format(start_day, end_day)
# Filter bitcoin time-series based on date
if end_day is None:
plot_bitcoin_data = bitcoin_data[bitcoin_data['timestamp'].str.contains(start_day)]
else:
plot_bitcoin_data = bitcoin_data[(bitcoin_data['timestamp'] >= start_day)
& (bitcoin_data['timestamp'] <= end_day)]
# Specifically used to plot the news articles on the time series
news_sentiment = plot_bitcoin_data[plot_bitcoin_data['article_published'] == 1]
pos_cor_sentiment = news_sentiment[(news_sentiment.predicted == 1) & (news_sentiment.sentiment == 1)]
pos_incor_sentiment = news_sentiment[(news_sentiment.predicted == 1) & (news_sentiment.sentiment != 1)]
neg_cor_sentiment = news_sentiment[(news_sentiment.predicted == -1) & (news_sentiment.sentiment == -1)]
neg_incor_sentiment = news_sentiment[(news_sentiment.predicted == -1) & (news_sentiment.sentiment != -1)]
# Plot Bitcoins Price
bitcoin_price_trace = go.Scatter(
x=plot_bitcoin_data.timestamp,
y=plot_bitcoin_data.price,
name='USD/BTC')
# Plot positive sentiment points that were correct (green O)
pos_correct_trace = go.Scatter(
x=pos_cor_sentiment.timestamp,
y=pos_cor_sentiment.price,
mode='markers',
name='Correctly Predicted Positive Sentiment',
marker = dict(size=7, line = dict(width=1),color='green'),
hoverinfo='none')
# Plot positive sentiment points that were incorrect (green x)
pos_incorrect_trace = go.Scatter(
x=pos_incor_sentiment.timestamp,
y=pos_incor_sentiment.price,
mode='markers',
name='Incorrectly Predicted Positive Sentiment',
marker = dict(symbol="x", size=7, line = dict(width=1), color='green'),
hoverinfo='none')
# Plot negative sentiment points that were correct (red O)
neg_correct_trace = go.Scatter(
x=neg_cor_sentiment.timestamp,
y=neg_cor_sentiment.price,
mode='markers',
name='Correctly Predicted Negative Sentiment',
marker = dict(size=7, line = dict(width=1),color='red'),
hoverinfo='none')
# Plot positive sentiment points that were incorrect (red x)
neg_incorrect_trace = go.Scatter(
x=neg_incor_sentiment.timestamp,
y=neg_incor_sentiment.price,
mode='markers',
name='Incorrectly Predicted Negative Sentiment',
marker = dict(symbol="x", size=7, line = dict(width=1), color='red'),
hoverinfo='none')
# Plot news Info
news_info_trace = go.Scatter(
x=news_sentiment.timestamp,
y=[min(plot_bitcoin_data.price) - 10]*len(news_sentiment.timestamp),
text='Title: ' + news_sentiment.loc[news_sentiment.timestamp].title + '<br>' +
'Source: ' + news_sentiment.loc[news_sentiment.timestamp].source + '<br>' +
'Author: ' + news_sentiment.loc[news_sentiment.timestamp].author + '<br>' +
'Actual Sentiment: ' + news_sentiment.loc[news_sentiment.timestamp].sentiment.astype(str) + '<br>' +
'Predicted Sentiment: ' + news_sentiment.loc[news_sentiment.timestamp].predicted.astype(str),
showlegend=False,
marker = dict(color='#1F77B4')
)
plot_data = [bitcoin_price_trace,
pos_correct_trace, pos_incorrect_trace,
neg_correct_trace, neg_incorrect_trace,
news_info_trace]
plot_layout = go.Layout(title=title,
titlefont=dict(family='Courier New, monospace', size=20),
xaxis=dict(
title='Time',
titlefont=dict(family='Courier New, monospace',size=18)),
yaxis=dict(
title='USD/BTC',
titlefont=dict(family='Courier New, monospace',size=18))
)
fig = dict(data=plot_data, layout=plot_layout)
iplot(fig)
plot_bitcoin_and_predicted_sentiment('2017-10-24')
plot_bitcoin_and_predicted_sentiment('2017-10-30')
After reviewing the visualizations above, I find it hard to believe that the current naives classifier has good predicting power. It appears that there are just as many X's (incorrect predictions) as there are O's (correct predictions). To verify, I build a method to determine the actual accuracy of the classifier. The statistics shown below are the results from the Naive Bayes Classifier. Overall its prediction power was only 51.64%. An interesting find is that it was a lot better at predicting positive sentiment, 58.54% compared to negative sentiment, a mere 43.67%.
Image(filename='naiveBayes_results.png')
One of the main obstacles of this project was figuring out a way to provide reproducible interactivity. Initially I had plans to make this an interactive visualization where the user can make choices and provide a unique data visualization. Unfortunately, once the Jupyter notebook is exported to HTML, all widgets and event listeners are converted to static code, altering them useless. The alternative was having the end user download the .ipynb file and installing all the dependent python modules needed and to run the notebook. This seemed like too much work, and decided to create a product that could be completely rendered in HTML. If I was to do this all over, I would of used a completely different technology stack for interactive visualizations, such as an RShiny application or a full-blown web-app in D3.
1) One improvement would be to develop a full naive-Bayes classifier by extracting the entire article (instead of just the description). This was my initial design but building a classifier by scratch with the entire set of words from each article would take an enormous amount of processing power. Also the scraper I designed keep running into issues extracting HTML from various sites, returning HTTP Error codes.
2) Further work could also be done on how the actual news was classified. I believe my approach was somewhat rudimentary (choosing the price differential one minute after the article was published).
3) I had only touched the surface with the NEWS API. There was so much flexibility in what type of news can be scraped, and the topic I chose was only related to Bitcoin. Future work could be done to extend the news to several different keywords for various Cryptocurrencies.
4) Also the Bitcoin data was only focused on one exchange, Coinbase (GDAX). Future work would extend this functionality for multiple exchanges, and even different cryptocurrencies.
5) And Finally, no real profit-analysis was done on the prediction model of Naive Bayes. To implement the sentiment analysis into an actual investment strategy, a trading strategy would need to be developed based on the signals of the classifier and finally analysis on the actual P/L would need to be done.
(1) https://newsapi.org/account
(2) https://bitcoincharts.com/
(3) https://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html
(4) https://www.pluralsight.com/courses/building-sentiment-analysis-systems-python